Python Project A4: Diabetes in US hospitals from 1999 to 2008

Python for data analysis - DIA 2

Richard Goudelin

Table of Content (Clickable)

  • 1. Overview
  • 2. Preprocessing
    • a) Libraries Import
    • b) Data Import
    • c) Data Exploration
    • d) Data Cleaning
      • (i) Missing Values
      • (ii) Multiple Encounters
      • (iii) Useless Columns
      • (iv) Missing Genders
      • (v) Mapping the Data
      • (vi) Data Types
      • (vii) Re-Writing Data
      • (viii) Eliminating Zero Variance Data
      • (ix) Mapping the Diag Columns
      • (x) Categorical Data
      • (xi) Encoding Data
      • (xii) Custom Variables
      • (xiii) Feature Engineering - chi2
      • (xiv) Data Balancing
      • (xv) End of Preprocessing
  • 3. Distribution Plots
  • 4. Machine Learning Models
    • a) Evaluating Models: Linear Regression, RandomForest and DecisionTree Classifiers
    • b) Feature Importances and Coefficients
    • c) Grid Search: Finding the best hyperparameters
    • d) Re-evaluating the models
  • 5. Regression Model Study
    • a) Evaluating Models for regression
    • a1. Little verification for the most important feature
    • b) Improve the data with PCA
    • c) Find hyperparameters for the best model (Gradient boosting)
  • 6. Conclusion

1. Project Overview

This project dives into an in-depth analysis of diabetes-related hospital admissions in the US, spanning a decade from 1999 to 2008. The core focus is on evaluating and understanding the patterns and trends in patient readmissions. Through a detailed examination of a comprehensive dataset, we shed light on various elements, including patient demographics, hospital attributes, treatment efficacy, and, crucially, the rates and determinants of readmission among diabetes patients.

Objectives

  • Examine key factors contributing to the readmission of patients with diabetes.
  • Investigate decade-long trends in diabetes treatments and patient outcomes.

Methodology

Utilizing Python for analysis, we employed data manipulation tools like Pandas, Matplotlib and Seaborn for data visualization. Our dataset, obtained from the US Department of Health, encompasses over 100,000 records of hospital admissions. Advanced Classifier models such as Linear Regression, Random Forest, and Decision Tree from scikit-learn were used.

Reflections

This study brings to light the critical challenges in managing diabetes in the US healthcare context. It underscores the urgent need for healthcare policy reforms targeted at reducing the frequency of readmissions. Our findings advocate for enhanced resource allocation toward preventive care and effective diabetes management strategies.


2. Preprocessing


a. Librairies Import

In [2]:
import numpy as np
import pandas as pd

#For Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

#For plots
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib import rcParams

import warnings
warnings.filterwarnings("ignore")

b. Data Import

In [3]:
pd.set_option('display.max_columns', None)

data = pd.read_csv('diabetic_data.csv', delimiter=',')
ids = pd.read_csv('IDS_mapping.csv', delimiter=',', header=None)

c. Data Exploration

In [4]:
data.head()
Out[4]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ? Pediatrics-Endocrinology 41 0 1 0 0 0 250.83 ? ? 1 NaN NaN No No No No No No No No No No No No No No No No No No No No No No No No No NO
1 149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ? ? 59 0 18 0 0 0 276 250.01 255 9 NaN NaN No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes >30
2 64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ? ? 11 5 13 2 0 1 648 250 V27 6 NaN NaN No No No No No No Steady No No No No No No No No No No No No No No No No No Yes NO
3 500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ? ? 44 1 16 0 0 0 8 250.43 403 7 NaN NaN No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes NO
4 16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ? ? 51 0 8 0 0 0 197 157 250 5 NaN NaN No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes NO
In [5]:
print("Number of Lines", data.shape[0])
print("Number of Columns", data.shape[1])
Number of Lines 101766
Number of Columns 50
In [6]:
ids.head()
Out[6]:
0 1
0 admission_type_id description
1 1 Emergency
2 2 Urgent
3 3 Elective
4 4 Newborn
In [7]:
print("Number of Lines", ids.shape[0])
print("Number of Columns", ids.shape[1])
Number of Lines 68
Number of Columns 2

d. Data Cleaning

(i) Missing Values

</a>

In [8]:
# Assessing missing values
missing_values = data.replace('?', pd.NA).isna().sum()

# Percentage of missing values
missing_percentage = (missing_values / len(data)) * 100

missing_summary = pd.DataFrame({'Number of Missing Values': missing_values, 'Percentage': missing_percentage})
missing_summary.sort_values(by='Percentage', ascending=False)
Out[8]:
Number of Missing Values Percentage
weight 98569 96.858479
max_glu_serum 96420 94.746772
A1Cresult 84748 83.277322
medical_specialty 49949 49.082208
payer_code 40256 39.557416
race 2273 2.233555
diag_3 1423 1.398306
diag_2 358 0.351787
diag_1 21 0.020636
encounter_id 0 0.000000
troglitazone 0 0.000000
tolbutamide 0 0.000000
pioglitazone 0 0.000000
rosiglitazone 0 0.000000
acarbose 0 0.000000
miglitol 0 0.000000
citoglipton 0 0.000000
tolazamide 0 0.000000
examide 0 0.000000
glipizide 0 0.000000
insulin 0 0.000000
glyburide-metformin 0 0.000000
glipizide-metformin 0 0.000000
glimepiride-pioglitazone 0 0.000000
metformin-rosiglitazone 0 0.000000
metformin-pioglitazone 0 0.000000
change 0 0.000000
diabetesMed 0 0.000000
glyburide 0 0.000000
repaglinide 0 0.000000
acetohexamide 0 0.000000
glimepiride 0 0.000000
gender 0 0.000000
age 0 0.000000
admission_type_id 0 0.000000
discharge_disposition_id 0 0.000000
admission_source_id 0 0.000000
time_in_hospital 0 0.000000
num_lab_procedures 0 0.000000
num_procedures 0 0.000000
num_medications 0 0.000000
number_outpatient 0 0.000000
number_emergency 0 0.000000
number_inpatient 0 0.000000
number_diagnoses 0 0.000000
metformin 0 0.000000
patient_nbr 0 0.000000
nateglinide 0 0.000000
chlorpropamide 0 0.000000
readmitted 0 0.000000

Observations

- Weight is missing though important

- Medical specialty: Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon + useless

- Payer code: Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay + useless

- Max glu serum: Indicates the range of the result or if the test was not taken. Values: >200, >300, normal, and none if not measured

- A1Cresult: Indicates the range of the result or if the test was not taken. Values: >8 if the result was greater than 8%, >7 if the result was greater than 7% but less than 8%, normal if the result was less than 7%, and none if not measured.

In [9]:
data.replace('?', pd.NA, inplace=True)
columns_to_drop = ['weight', 'medical_specialty', 'payer_code','max_glu_serum','A1Cresult']
data = data.drop(columns=columns_to_drop)


cols = ['race', 'diag_1', 'diag_2', 'diag_3']
data.dropna(subset=cols, inplace=True)

(ii) Multiple Encounters

In [10]:
nombre_redondant = data.duplicated(subset=['patient_nbr']).sum()

data = data.drop_duplicates(subset=['patient_nbr'])

print("Nombre de patient_nbr redondants :", nombre_redondant)
Nombre de patient_nbr redondants : 29423

(iii) Useless Columns

In [11]:
columns_to_drop = ['encounter_id', 'patient_nbr', 'payer_code']

#We drop the columns only if they exist in the DataFrame
data = data.drop([col for col in columns_to_drop if col in data.columns], axis=1)

(iv) Missing Genders

We need to find out the number of undeclared genders, and then we will figure out what to do with them. Here, there are only 3 missing genders so we will just drop them.

Drop the variables with unknow gender

In [12]:
nombre_unknown = (data['gender'] == 'Unknown/Invalid').sum()


data = data[data['gender'] != 'Unknown/Invalid']


print("Nombre de 'unknown' dans la colonne 'gender':", nombre_unknown)
Nombre de 'unknown' dans la colonne 'gender': 1

(v) Mapping the data

We have several mapping

In [13]:
#mapping admissions id 
mapping_admission_id = {1.0:"Emergency",
          2.0:"Emergency",
          3.0:"Elective",
          4.0:"New Born",
          5.0:np.nan,
          6.0:np.nan,
          7.0:"Trauma Center",
          8.0:np.nan}

data.admission_type_id = data.admission_type_id.replace(mapping_admission_id)
In [14]:
plt.figure(figsize=(10, 6))
custom_palette = sns.color_palette("Reds", n_colors=len(data["admission_type_id"].unique()))
sns.countplot(x="admission_type_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
In [15]:
#mapping admission source id 
mapping_admission_source_id = {1:"Referral",2:"Referral",3:"Referral",
              4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
              9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
              15:np.nan,17:np.nan,20:np.nan,21:np.nan,
              7:"Emergency"}
data.admission_source_id = data.admission_source_id.replace(mapping_admission_source_id)
In [16]:
plt.figure(figsize=(10, 6))
custom_palette = sns.color_palette("pastel", n_colors=len(data["admission_source_id"].unique()))
sns.countplot(x="admission_source_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
In [17]:
#mapping discharge disposition
mapping_discharge_disp= {1:"Referral",2:"Referral",3:"Referral",
              4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
              9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
              15:np.nan,17:np.nan,20:np.nan,21:np.nan,
              7:"Emergency"}
data.discharge_disposition_id = data.discharge_disposition_id.replace(mapping_discharge_disp)
In [18]:
plt.figure(figsize=(6, 6))
custom_palette = sns.color_palette("pastel", n_colors=len(data["discharge_disposition_id"].unique()))
sns.countplot(x="discharge_disposition_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
In [19]:
#mapping admission type id
mapping_admission_type_id= {1:"Emergency",2:"Urgent",3:"Elective",
              4:"Newborn",5:"Not Available",6:"NULL",7:"Trauma Center",8:"Not Mapped"}
data.admission_type_id = data.admission_type_id.replace(mapping_admission_type_id)

(vi) Data Types

In [20]:
data['diag_1'] = pd.to_numeric(data['diag_1'], errors='coerce')
data['diag_2'] = pd.to_numeric(data['diag_2'], errors='coerce')
data['diag_3'] = pd.to_numeric(data['diag_3'], errors='coerce')

data['diag_1'] = data['diag_1'].astype(float)
data['diag_2'] = data['diag_2'].astype(float)
data['diag_3'] = data['diag_3'].astype(float)
In [21]:
cols = ['diag_1', 'diag_2', 'diag_3']
data.dropna(subset=cols, inplace=True)

(vii) Re-Writing Data

We do this in the purpose of having simpler and adjusted datasets.

In [22]:
gender_types = data['age'].unique()
print("Différents types dans la colonne 'gender':")
print(gender_types)
Différents types dans la colonne 'gender':
['[10-20)' '[30-40)' '[40-50)' '[50-60)' '[70-80)' '[80-90)' '[90-100)'
 '[60-70)' '[20-30)' '[0-10)']
In [23]:
replaceDict = {'[0-10)' : 5,
'[10-20)' : 15,
'[20-30)' : 25, 
'[30-40)' : 35, 
'[40-50)' : 45, 
'[50-60)' : 55,
'[60-70)' : 65, 
'[70-80)' : 75,
'[80-90)' : 85,
'[90-100)' : 95}

data['age'] = data['age'].apply(lambda x : replaceDict[x])
print(data['age'].head())
1    15
3    35
4    45
5    55
7    75
Name: age, dtype: int64
In [24]:
colonnes = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
            'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
            'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone',
            'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed']

for colonne in colonnes:
    valeurs_uniques = data[colonne].unique()
    print("Colonnes :", colonne)
    print("Différentes valeurs uniques :")
    print(valeurs_uniques)
    print()
Colonnes : metformin
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : repaglinide
Différentes valeurs uniques :
['No' 'Up' 'Steady' 'Down']

Colonnes : nateglinide
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : chlorpropamide
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : glimepiride
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : acetohexamide
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : glipizide
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : glyburide
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : tolbutamide
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : pioglitazone
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : rosiglitazone
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : acarbose
Différentes valeurs uniques :
['No' 'Steady' 'Up']

Colonnes : miglitol
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : troglitazone
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : tolazamide
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : examide
Différentes valeurs uniques :
['No']

Colonnes : citoglipton
Différentes valeurs uniques :
['No']

Colonnes : insulin
Différentes valeurs uniques :
['Up' 'Steady' 'No' 'Down']

Colonnes : glyburide-metformin
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : glipizide-metformin
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : glimepiride-pioglitazone
Différentes valeurs uniques :
['No']

Colonnes : metformin-rosiglitazone
Différentes valeurs uniques :
['No']

Colonnes : metformin-pioglitazone
Différentes valeurs uniques :
['No' 'Steady']

Colonnes : change
Différentes valeurs uniques :
['Ch' 'No']

Colonnes : diabetesMed
Différentes valeurs uniques :
['Yes' 'No']

(viii) Eliminating Data with Zero Variance

As we can see just above there are data with only one data possible (it mean that for everyone in the data set this data value is the same => it is useless) so we will be looking at zero variance data(data that don't change a lot will be removed)

In [25]:
from sklearn.feature_selection import VarianceThreshold

encoded_colonnes = data[colonnes].copy()

#Replace values in the selected columns
replacement_dict = {'Up': 2, 'Steady': 1, 'No': -1, 'Down': -2,'Ch' : 1, 'Yes' : 1}
encoded_colonnes.replace(replacement_dict, inplace=True)


#Apply VarianceThreshold to remove near-zero variance columns
threshold = 0.05  # Set the threshold as desired (adjust as needed)
selector = VarianceThreshold(threshold=threshold)
filtered_colonnes = selector.fit_transform(encoded_colonnes)

selected_columns = encoded_colonnes.columns[selector.get_support()]
print("Here are the column that we will be keeping : " , list(selected_columns))

toDelete=list(set(colonnes)-set(selected_columns))
print("We will delete : ", len(toDelete), " columns !")

data = data.drop([col for col in toDelete if col in data.columns], axis=1)
Here are the column that we will be keeping :  ['metformin', 'repaglinide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'insulin', 'change', 'diabetesMed']
We will delete :  15  columns !

(ix) Mapping Diag Columns

In the paper given we can see that each number within diag is corresponding to a type of medecine example 391 is for circulatory issues

In [26]:
def mapping_diag(data, col):
    conditions = [
        ((data[col] >= 390) & (data[col] < 460)) | (data[col] == 785),
        ((data[col] >= 460) & (data[col] < 520)) | (data[col] == 786),
        ((data[col] >= 520) & (data[col] < 580)) | (data[col] == 787),
        (data[col] == 250),
        ((data[col] >= 800) & (data[col] <= 1000)),
        ((data[col] >= 710) & (data[col] <= 740)),
        ((data[col] >= 580) & (data[col] < 630)) | (data[col] == 788),
        ((data[col] >= 140) & (data[col] <= 240)),
        ((data[col] >= 630) & (data[col] <= 680))
    ]
    
    choices = [
        'Circulatory',
        'Respiratory',
        'Digestive',
        'Diabetes',
        'Injury',
        'Musculoskeletal',
        'Genitourinary',
        'Neoplasms',
        'Pregnancy'
    ]
    
    data[col] = np.select(conditions, choices, default='Other')
    
    return data
In [27]:
data=mapping_diag(data,"diag_1")
data=mapping_diag(data,"diag_2")
data=mapping_diag(data,"diag_3")
In [28]:
count_other = data[data['diag_1'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)

count_other = data[data['diag_2'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)

count_other = data[data['diag_3'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)
Number of 'Other' occurrences in diag_1 column: 14269
Number of 'Other' occurrences in diag_1 column: 19067
Number of 'Other' occurrences in diag_1 column: 19186

(x) Categorical Data

Transforming categorical columns into a data_cleaned dataframe

In [29]:
data_cleaned=data
In [30]:
#mapping des admissions id 
mapping_admission_id = {1.0:"Emergency",
          2.0:"Emergency",
          3.0:"Elective",
          4.0:"New Born",
          5.0:np.nan,
          6.0:np.nan,
          7.0:"Trauma Center",
          8.0:np.nan}

data_cleaned.admission_type_id = data.admission_type_id.replace(mapping_admission_id)
In [31]:
#mapping des admission source id 
data_cleaned=data
mapping_admission_source_id = {1:"Referral",2:"Referral",3:"Referral",
              4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
              9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
              15:np.nan,17:np.nan,20:np.nan,21:np.nan,
              7:"Emergency"}
data_cleaned.admission_source_id = data_cleaned.admission_source_id.replace(mapping_admission_source_id)
In [32]:
#mapping des discharge disposition
mapping_discharge_disp= {1:"Referral",2:"Referral",3:"Referral",
              4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
              9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
              15:np.nan,17:np.nan,20:np.nan,21:np.nan,
              7:"Emergency"}
data_cleaned.discharge_disposition_id = data_cleaned.discharge_disposition_id.replace(mapping_discharge_disp)
In [33]:
cols = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'diag_3']
data.dropna(subset=cols, inplace=True)

(xi) Encoding Data

Transforming non-numerical columns into a data_encoded dataframe

In [34]:
for colonne in data.columns:
    valeurs_uniques = data[colonne].unique()
    print("Colonnes :", colonne)
    print("Différentes valeurs uniques :")
    print(valeurs_uniques)
    print()
Colonnes : race
Différentes valeurs uniques :
['Caucasian' 'AfricanAmerican' 'Other' 'Asian' 'Hispanic']

Colonnes : gender
Différentes valeurs uniques :
['Female' 'Male']

Colonnes : age
Différentes valeurs uniques :
[15 35 45 55 75 85 95 65 25  5]

Colonnes : admission_type_id
Différentes valeurs uniques :
['Emergency' 'Elective' 'New Born' 'Trauma Center']

Colonnes : discharge_disposition_id
Différentes valeurs uniques :
['Referral' 'Other' 'Emergency' 18 12 23 28 24 19 27]

Colonnes : admission_source_id
Différentes valeurs uniques :
['Emergency' 'Referral' 'Other']

Colonnes : time_in_hospital
Différentes valeurs uniques :
[ 3  2  1  5 13 12  9  7 10  4  6 11  8 14]

Colonnes : num_lab_procedures
Différentes valeurs uniques :
[ 59  44  51  31  73  68  33  47  62  60  55  49  75  45  29  35  19  64
  25  53  52  87  27  37  46  41  28  36  72  10   2  65  67  40  58  57
  32  34  39  69  56  22  96  78  48  70  66  18   9  24  63  76  90  77
  80  71  50  38  43  61  42  84  74  23  54  95  83  30  12  79  13  81
  15   1  86   5  26   6   8  11 100  21  88   7  89  91  16  94  82  93
  14 105  97   4 101  17  98  20  85   3  92 102 108 106  99 113 109 111
 103 132 121 118]

Colonnes : num_procedures
Différentes valeurs uniques :
[0 1 6 2 3 5 4]

Colonnes : num_medications
Différentes valeurs uniques :
[18 16  8 12 28 17 11 15 31  2 13 23  7 20 14 10 22  9 27 19 25 21  6 30
 24 33 29  4 40  5 46 41 36 26 43 42 34 51 35  3 45 37 32 38  1 54 49 39
 52 47 44 48 53 62 50 56 61 59 57 55 58 70 67 63 64 60 69 65 66 68 81 79
 75 72 74]

Colonnes : number_outpatient
Différentes valeurs uniques :
[ 0  1  5  2  7  3  8  4 12 11  6 20 15 14  9 10 21 16 29 13 18 19 22 24
 42 33 17 25 23]

Colonnes : number_emergency
Différentes valeurs uniques :
[ 0  1  4  2  3  6  8  7 13  5 42 10  9 25 11 37 20 16]

Colonnes : number_inpatient
Différentes valeurs uniques :
[ 0  2  1  3  5  4  6  8 11  7  9 10 12]

Colonnes : diag_1
Différentes valeurs uniques :
['Other' 'Neoplasms' 'Circulatory' 'Respiratory' 'Injury' 'Genitourinary'
 'Musculoskeletal' 'Digestive' 'Pregnancy' 'Diabetes']

Colonnes : diag_2
Différentes valeurs uniques :
['Other' 'Neoplasms' 'Circulatory' 'Respiratory' 'Injury' 'Genitourinary'
 'Digestive' 'Diabetes' 'Musculoskeletal' 'Pregnancy']

Colonnes : diag_3
Différentes valeurs uniques :
['Other' 'Circulatory' 'Diabetes' 'Respiratory' 'Injury' 'Neoplasms'
 'Genitourinary' 'Digestive' 'Musculoskeletal' 'Pregnancy']

Colonnes : number_diagnoses
Différentes valeurs uniques :
[ 9  7  5  8  3  6  4 16 12 13 15 11 14 10]

Colonnes : metformin
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : repaglinide
Différentes valeurs uniques :
['No' 'Up' 'Steady' 'Down']

Colonnes : glimepiride
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : glipizide
Différentes valeurs uniques :
['No' 'Steady' 'Down' 'Up']

Colonnes : glyburide
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : pioglitazone
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : rosiglitazone
Différentes valeurs uniques :
['No' 'Steady' 'Up' 'Down']

Colonnes : insulin
Différentes valeurs uniques :
['Up' 'Steady' 'No' 'Down']

Colonnes : change
Différentes valeurs uniques :
['Ch' 'No']

Colonnes : diabetesMed
Différentes valeurs uniques :
['Yes' 'No']

Colonnes : readmitted
Différentes valeurs uniques :
['>30' 'NO' '<30']

In [35]:
data_encoded = pd.DataFrame()
In [36]:
def categorize_readmission(value):
    if value == 'NO':
        return 0
    else:
        return 1

#Apply the function to the 'readmitted' column and create a new column with the results
data_encoded['readmitted'] = data['readmitted'].apply(categorize_readmission)
In [37]:
#Categorical columns for one-hot encoding
categorical_cols = ['gender', 'admission_type_id', 'discharge_disposition_id', 
                    'admission_source_id', 'diag_1', 'diag_2', 'diag_3', 
                    'metformin', 'repaglinide', 'glimepiride', 'glipizide', 
                    'glyburide', 'pioglitazone', 'rosiglitazone', 'insulin', 
                    'change', 'diabetesMed']

#Encode each categorical column
for col in categorical_cols:
    dummies = pd.get_dummies(data[col], prefix=col, drop_first=True)
    data_encoded = pd.concat([data_encoded, dummies], axis=1)

#Add numerical columns as they are
numerical_cols = ['age', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 
                  'num_medications', 'number_outpatient', 'number_emergency', 
                  'number_inpatient', 'number_diagnoses']
data_encoded = pd.concat([data_encoded, data[numerical_cols]], axis=1)
In [38]:
data_encoded.head(3)
Out[38]:
readmitted gender_Male admission_type_id_Emergency admission_type_id_New Born admission_type_id_Trauma Center discharge_disposition_id_18 discharge_disposition_id_19 discharge_disposition_id_23 discharge_disposition_id_24 discharge_disposition_id_27 discharge_disposition_id_28 discharge_disposition_id_Emergency discharge_disposition_id_Other discharge_disposition_id_Referral admission_source_id_Other admission_source_id_Referral diag_1_Diabetes diag_1_Digestive diag_1_Genitourinary diag_1_Injury diag_1_Musculoskeletal diag_1_Neoplasms diag_1_Other diag_1_Pregnancy diag_1_Respiratory diag_2_Diabetes diag_2_Digestive diag_2_Genitourinary diag_2_Injury diag_2_Musculoskeletal diag_2_Neoplasms diag_2_Other diag_2_Pregnancy diag_2_Respiratory diag_3_Diabetes diag_3_Digestive diag_3_Genitourinary diag_3_Injury diag_3_Musculoskeletal diag_3_Neoplasms diag_3_Other diag_3_Pregnancy diag_3_Respiratory metformin_No metformin_Steady metformin_Up repaglinide_No repaglinide_Steady repaglinide_Up glimepiride_No glimepiride_Steady glimepiride_Up glipizide_No glipizide_Steady glipizide_Up glyburide_No glyburide_Steady glyburide_Up pioglitazone_No pioglitazone_Steady pioglitazone_Up rosiglitazone_No rosiglitazone_Steady rosiglitazone_Up insulin_No insulin_Steady insulin_Up change_No diabetesMed_Yes age time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses
1 1 False True False False False False False False False False False False True False False False False False False False False True False False False False False False False False True False False False False False False False False True False False True False False True False False True False False True False False True False False True False False True False False False False True False True 15 3 59 0 18 0 0 0 9
3 0 True True False False False False False False False False False False True False False False False False False False False True False False False False False False False False True False False False False False False False False False False False True False False True False False True False False True False False True False False True False False True False False False False True False True 35 2 44 1 16 0 0 0 7
4 0 True True False False False False False False False False False False True False False False False False False False True False False False False False False False False True False False False True False False False False False False False False True False False True False False True False False False True False True False False True False False True False False False True False False True 45 1 51 0 8 0 0 0 5

(xii) Custom Variables

In [39]:
cols_to_sum = ['number_emergency', 'number_inpatient', 'number_outpatient', 
               'time_in_hospital', 'num_procedures', 'num_medications', 
               'num_lab_procedures', 'number_diagnoses']
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data[cols_to_sum]), 
                           columns=cols_to_sum, 
                           index=data.index)

data['health'] = data_scaled.sum(axis=1)

(xiii) Feature Engineering - Chi2 Test

In [40]:
object_columns = data.columns.tolist()
In [41]:
X=data[object_columns]
y = data['readmitted']
X_str = X.applymap(str)

le = LabelEncoder()
X_encoded = X_str.apply(le.fit_transform)
y_encoded = le.fit_transform(y)

scores, pvalues = chi2(X_encoded, y_encoded)

for col, p in sorted(zip(object_columns, pvalues),key = lambda x:x[1],reverse = True):
    print(f"{col}: {p}")
insulin: 0.9845946231347301
glimepiride: 0.9220213529372799
diag_2: 0.8587245323491828
repaglinide: 0.7638825309383867
glyburide: 0.5710481526624618
pioglitazone: 0.42117989442777937
race: 0.376899054575053
rosiglitazone: 0.3071210114680176
metformin: 0.2927136593791332
glipizide: 0.2763222149381983
gender: 0.03302510488129809
admission_type_id: 1.2247881503442352e-06
discharge_disposition_id: 1.0402444969718035e-06
diag_3: 1.0404460655681793e-08
change: 5.368607337761889e-10
diabetesMed: 2.947011238818089e-13
diag_1: 4.411480148584445e-19
age: 1.4331081387194604e-25
num_procedures: 6.066056972462838e-26
admission_source_id: 1.3290849176207581e-40
number_diagnoses: 9.526661076607155e-42
time_in_hospital: 2.915168094672371e-69
num_lab_procedures: 6.483474719457476e-147
num_medications: 0.0
number_outpatient: 0.0
number_emergency: 0.0
number_inpatient: 0.0
readmitted: 0.0
health: 0.0
In [42]:
pvalue_dict = dict(zip(object_columns, pvalues))

selected_columns = [col for col, p in pvalue_dict.items() if p <= 0.3]

dataForClassification= data[selected_columns]
In [43]:
dataForClassification.head()
Out[43]:
gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_3 number_diagnoses metformin glipizide change diabetesMed readmitted health
1 Female 15 Emergency Referral Emergency 3 59 0 18 0 0 0 Other Other 9 No No Ch Yes >30 -0.148877
3 Male 35 Emergency Referral Emergency 2 44 1 16 0 0 0 Other Circulatory 7 No No Ch Yes NO -1.994875
4 Male 45 Emergency Referral Emergency 1 51 0 8 0 0 0 Neoplasms Diabetes 5 No Steady Ch Yes NO -4.544908
5 Male 55 Emergency Referral Referral 3 31 6 16 0 0 0 Circulatory Diabetes 9 No No No Yes >30 1.529434
7 Male 75 Emergency Referral Emergency 5 73 0 12 0 0 0 Circulatory Diabetes 8 No No No Yes >30 -0.001356
In [44]:
df = data[data["readmitted"] == "NO"]
df=df.drop("readmitted", axis=1)
df.head()
Out[44]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses metformin repaglinide glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed health
3 Caucasian Male 35 Emergency Referral Emergency 2 44 1 16 0 0 0 Other Other Circulatory 7 No No No No No No No Up Ch Yes -1.994875
4 Caucasian Male 45 Emergency Referral Emergency 1 51 0 8 0 0 0 Neoplasms Neoplasms Diabetes 5 No No No Steady No No No Steady Ch Yes -4.544908
8 Caucasian Female 85 Emergency Referral Other 13 68 2 28 0 0 0 Circulatory Circulatory Other 8 No No No Steady No No No Steady Ch Yes 5.460957
9 Caucasian Female 95 Elective Referral Other 12 33 3 18 0 0 0 Circulatory Neoplasms Respiratory 8 No No No No No No Steady Steady Ch Yes 2.712900
13 Caucasian Male 85 Emergency Other Emergency 10 55 1 31 0 0 0 Circulatory Circulatory Circulatory 8 No No No No No No No Steady No Yes 3.581884
In [45]:
object_columns = df.columns.tolist()
In [46]:
X=df[object_columns]
y = df['time_in_hospital']
X_str = X.applymap(str)

le = LabelEncoder()
X_encoded = X_str.apply(le.fit_transform)
y_encoded = le.fit_transform(y)

scores, pvalues = chi2(X_encoded, y_encoded)

for col, p in sorted(zip(object_columns, pvalues),key = lambda x:x[1],reverse = True):
    print(f"{col}: {p}")
glimepiride: 0.9999991220928548
repaglinide: 0.9999985917546521
pioglitazone: 0.9999933010688921
rosiglitazone: 0.9999493250644083
glipizide: 0.9967681880425915
glyburide: 0.9957001425055398
race: 0.6071671998384902
metformin: 0.4294585077261339
gender: 0.004721128561146807
number_emergency: 5.693645553959853e-05
admission_type_id: 8.27761135567137e-07
diabetesMed: 1.029605224401174e-09
insulin: 2.120330812734523e-12
discharge_disposition_id: 1.8957265135220723e-20
diag_1: 2.4802638151403367e-24
admission_source_id: 9.635565747763743e-29
number_inpatient: 1.728516010165934e-30
change: 3.4701419693626655e-40
number_outpatient: 3.775637771913903e-58
age: 2.0007922774861345e-79
number_diagnoses: 1.5770478540342118e-139
diag_2: 1.6857616786074464e-167
diag_3: 2.473449034724946e-232
time_in_hospital: 0.0
num_lab_procedures: 0.0
num_procedures: 0.0
num_medications: 0.0
health: 0.0
In [47]:
pvalue_dict = dict(zip(object_columns, pvalues))

selected_columns = [col for col, p in pvalue_dict.items() if p <= 0.3]

dataForReg = data[selected_columns]

(xiv) Balancing Data

In [48]:
data['isReadmitted'] = data['readmitted'].apply(lambda x: x in ['<30', '>30'])
counts = data['isReadmitted'].value_counts()

count_true = data['isReadmitted'].sum()
count_false = len(data) - count_true

data_false = data[data['isReadmitted'] == False].sample(n=count_true, random_state=42)
data_true = data[data['isReadmitted'] == True]

data = pd.concat([data_true, data_false])

data = data.sample(frac=1, random_state=42)
counts = data['isReadmitted'].value_counts()
print("Répartition après : ",counts)
Répartition après :  isReadmitted
True     21813
False    21813
Name: count, dtype: int64

(xv) End of Preprocessing

So now we got 28 columns(versus 50 before) and 63002 rows (101766 before) We have three dataframes: Data, Data_cleaned (with categorical tabulation), and Data_encoded (only encoded data)

In [49]:
data.head()
Out[49]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses metformin repaglinide glimepiride glipizide glyburide pioglitazone rosiglitazone insulin change diabetesMed readmitted health isReadmitted
62983 Caucasian Male 85 Emergency Referral Emergency 2 56 1 17 0 0 0 Respiratory Respiratory Circulatory 9 No No No No No No No No No Yes >30 -0.201808 True
10269 Caucasian Female 45 Emergency Referral Emergency 5 68 0 13 0 0 3 Digestive Other Other 6 No No No No No No No Steady No Yes >30 3.905023 True
35717 Caucasian Female 85 Emergency Other Emergency 1 26 0 9 0 1 1 Injury Injury Other 9 No No No No No No No No No No >30 0.009277 True
3992 Caucasian Male 95 Emergency Other Emergency 3 36 0 9 0 0 0 Other Respiratory Other 5 No No No No No No No Steady No Yes NO -4.513091 False
22678 AfricanAmerican Male 75 Emergency Other Emergency 10 66 1 27 0 0 1 Circulatory Circulatory Respiratory 6 No No No No No No No No No No NO 4.306637 False


3. Distribution Plots


In [50]:
fig, axes = plt.subplots(3, 3, figsize=(20, 24))  # Adjust the size as needed

# Unified color palette
custom_palette = sns.color_palette("pastel", n_colors=10)  # Adjust 'n_colors' as needed

# Plot 1: Age distribution
sns.countplot(x="age", data=data, palette=custom_palette, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')
axes[0, 0].tick_params(axis='x', rotation=90)

# Plot 2: Race distribution
sns.countplot(x="race", data=data, palette=custom_palette, ax=axes[0, 1])
axes[0, 1].set_title('Race Distribution')
axes[0, 1].tick_params(axis='x', rotation=90)

# Plot 3: Gender distribution
sns.countplot(x="gender", data=data, palette=custom_palette, ax=axes[0, 2])
axes[0, 2].set_title('Gender Distribution')
axes[0, 2].tick_params(axis='x', rotation=90)

# Plot 4: Admission type distribution
sns.countplot(x="admission_type_id", data=data, palette=custom_palette, ax=axes[1, 0])
axes[1, 0].set_title('Admission Type Distribution')
axes[1, 0].tick_params(axis='x', rotation=90)

# Plot 5: Admission source distribution
sns.countplot(x="admission_source_id", data=data_cleaned, palette=custom_palette, ax=axes[1, 1])
axes[1, 1].set_title('Admission Source Distribution')
axes[1, 1].tick_params(axis='x', rotation=90)

# Plot 6: Discharge disposition distribution
sns.countplot(x="discharge_disposition_id", data=data_cleaned, palette=custom_palette, ax=axes[1, 2])
axes[1, 2].set_title('Discharge Disposition Distribution')
axes[1, 2].tick_params(axis='x', rotation=90)

# Plot 7: Time in hospital distribution
sns.countplot(x="time_in_hospital", data=data_cleaned, palette=custom_palette, ax=axes[2, 0])
axes[2, 0].set_title('Time in Hospital Distribution')
axes[2, 0].tick_params(axis='x', rotation=90)

# Plot 8: Readmission distribution
sns.countplot(x="readmitted", data=data_cleaned, palette=custom_palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission Distribution')
axes[2, 1].tick_params(axis='x', rotation=90)


# Hide the empty subplot
axes[2, 2].axis('off')


# Adjust layout
plt.tight_layout()
plt.show()

Analysis

After conducting ananalysis of the provided graphs, several key findings emerge. Firstly, the data reveals a significant prevalence of Caucasian seniors within the patient population. Specifically, the majority of individuals seeking medical care fall within the age range of 55 to 85 years old.

In terms of gender distribution, no definitive conclusions can be drawn as the number of cases is nearly evenly split between males and females. Regarding hospital admissions, it is evident that a majority of patients are admitted due to emergency circumstances. Additionally, their discharge disposition typically occurs through a referral process. Notably, the data indicates a decreasing trend in the length of hospital stays, with the majority of patients remaining hospitalized for a duration of 1 to 5 days.

The final graph focuses on patient readmissions. The data demonstrates a relatively equal distribution between patients who are readmitted and those who are not. However, what varies is the number of days elapsed before readmission takes place. Overall, the parameters studied in this analysis do not reveal any distinct characteristics specific to diabetic patients. Nevertheless, a significant proportion of these individuals experience readmission to the hospital. Therefore, it would be valuable to explore potential correlations between these parameters and the readmission status of diabetic patients.

In [51]:
fig, axes = plt.subplots(4, 2, figsize=(14, 12))

# Plot for race
palette = ["#FFC0CB", "#6495ED","#FFFF00"]
sns.countplot(x="race", hue="readmitted", data=data, palette=palette, ax=axes[0, 0])
axes[0, 0].set_title('Readmission according to the race')

# Plot for age
sns.countplot(x="age", hue="readmitted", data=data, palette=palette, ax=axes[0, 1])
axes[0, 1].set_title('Readmission according to the age')

# Plot for gender
sns.countplot(x="gender", hue="readmitted", data=data, palette=palette, ax=axes[1, 0])
axes[1, 0].set_title('Readmission according to the gender')

# Plot for time spent in hospital
sns.countplot(x="time_in_hospital", hue="readmitted", data=data, palette=palette, ax=axes[1, 1])
axes[1, 1].set_title('Readmission according to the time spent in the hospital')

#Plot for admission type id
sns.countplot(x="admission_type_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 0])
axes[2, 0].set_title('Readmission according to the admission type id')

#Plot for admission type source 
sns.countplot(x="admission_source_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission according to the admission source id')

#Plot for discharge disposition id
sns.countplot(x="discharge_disposition_id", hue="readmitted", data=data, palette=palette, ax=axes[3, 0])
axes[3, 0].set_title('Readmission according to the discharge disposition id')

#Plot for number of diagnoses
sns.countplot(x="number_diagnoses", hue="readmitted", data=data, palette=palette, ax=axes[3, 1])
axes[3, 1].set_title('Readmission according to the number of medications')

plt.tight_layout()
plt.show()

Analysis

Due to the imbalanced distribution of classes in the dataset, with the "NO" class constituting approximately 50% of the data and the other classes having significantly fewer labels, a class imbalance problem arises. To mitigate this issue, we will reframe the problem as a two-class classification task. Specifically, our objective will be to predict whether a patient will be readmitted or not, without considering the distinction between readmissions within less than 30 days or greater than 30 days.

By simplifying the problem in this manner, we can effectively address the class imbalance problem and focus our predictive efforts on the primary task of determining readmission likelihood. This approach enables us to allocate appropriate attention and resources to accurately predicting the patient's readmission status

In [52]:
#Readmission status according to the race,age, gender and time spent in hospital
fig, axes = plt.subplots(4, 2, figsize=(18, 16))

# Plot for race
palette = ["#FFC0CB", "#6495ED"]
sns.countplot(x="race", hue="readmitted", data=data, palette=palette, ax=axes[0, 0])
axes[0, 0].set_title('Readmission according to the race')

# Plot for age
sns.countplot(x="age", hue="readmitted", data=data, palette=palette, ax=axes[0, 1])
axes[0, 1].set_title('Readmission according to the age')

# Plot for gender
sns.countplot(x="gender", hue="readmitted", data=data, palette=palette, ax=axes[1, 0])
axes[1, 0].set_title('Readmission according to the gender')

# Plot for time spent in hospital
sns.countplot(x="time_in_hospital", hue="readmitted", data=data, palette=palette, ax=axes[1, 1])
axes[1, 1].set_title('Readmission according to the time spent in the hospital')

# Plot for admission type id
sns.countplot(x="admission_type_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 0])
axes[2, 0].set_title('Readmission according to the admission type id')

# Plot for admission source id
sns.countplot(x="admission_source_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission according to the admission source id')

# Plot for discharge disposition id
sns.countplot(x="discharge_disposition_id", hue="readmitted", data=data, palette=palette, ax=axes[3, 0])
axes[3, 0].set_title('Readmission according to the discharge disposition id')

# Plot for number of diagnoses
sns.countplot(x="number_diagnoses", hue="readmitted", data=data, palette=palette, ax=axes[3, 1])
axes[3, 1].set_title('Readmission according to the number of diagnoses')

plt.tight_layout()
plt.show()

Analysis

Upon analyzing the various graphs depicting the relationship between readmission and different parameters such as race, age, gender, and time spent in the hospital, a notable observation emerges. Overall, these graphs reveal a lack of discernible distinctions between individuals who were readmitted and those who were not. This indicates that the factors being examined do not exhibit a significant impact on the likelihood of readmission.

For instance, when transitioning from analyzing the influence of race to age, it becomes apparent that both readmitted and not readmitted individuals display similar trends. There is no evident divergence in the patterns followed by each status class, suggesting that age does not play a substantial role in readmission likelihood.

Similarly, when exploring the relationship between gender and readmission, the graphs demonstrate comparable trajectories for both status classes. This similarity implies that gender does not serve as a distinguishing factor in predicting readmission probability.

Furthermore, examining the impact of the time spent in the hospital on readmission reveals a lack of substantial differentiation between the two status classes. The trends observed for both readmitted and not readmitted individuals follow nearly identical paths, indicating that the duration of hospitalization does not significantly affect the likelihood of readmission.

Overall, these findings suggest that the parameters investigated, including race, age, gender, and time spent in the hospital, do not exhibit clear distinctions between individuals who are readmitted and those who are not. This indicates the need for further exploration and consideration of additional factors that may contribute to the prediction of readmission likelihood in order to gain a more comprehensive understanding of the underlying factors at play.

In [53]:
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Plot configurations
plot_configs = [
    {'x': 'num_lab_procedures', 'title': 'Relationship of Lab Procedures with Readmission'},
    {'x': 'time_in_hospital', 'title': 'Relationship of Time in Hospital with Readmission'},
    {'x': 'num_medications', 'title': 'Relationship of Medications with Readmission'},
    {'x': 'num_procedures', 'title': 'Relationship of Procedures with Readmission'},
    {'x': 'number_diagnoses', 'title': 'Relationship of Diagnoses with Readmission'}
]

for i, config in enumerate(plot_configs):
    row, col = divmod(i, 3)
    sns.kdeplot(data=data, x=config['x'], hue='readmitted', ax=axes[row, col])
    axes[row, col].set_title(config['title'])

axes[1, 2].axis('off')
plt.tight_layout()
plt.show()

Analysis

The three graphs depicting the correlations between lab procedures, hospital stay duration, and medication usage with the readmission status provide intriguing insights. Despite following a similar overall trend regardless of the readmission status, there are distinct patterns for each status category. Notably, it is evident that the "Not readmitted" status consistently exhibits higher values compared to the "Readmitted" status. Examining the relationship between lab procedures and readmission, we observe that both status categories generally increase and decrease simultaneously. However, patients who receive approximately 50 lab procedures are notably less likely to be readmitted. Similarly, a shorter initial hospital stay of 1 to 5 days is associated with a reduced likelihood of readmission. Furthermore, patients who are prescribed around 15 medications also show a decreased likelihood of readmission. Analyzing the number of procedures performed, we find that both status categories exhibit a similar trend, although the paths diverge more as the number of procedures increases. Notably, there appears to be a lower readmission rate when patients undergo 0 to 3 procedures. Additionally, exploring the correlation between the number of diagnoses and readmission status reveals minimal variance. However, a slight distinction can be observed when a patient receives more than 5 diagnoses, as they exhibit a lower likelihood of readmission. Overall, these findings highlight the potential influence of lab procedures, hospital stay duration, medication usage, and the number of diagnoses on readmission rates.

In [54]:
colors = ["#FFC0CB", "#FF69B4", "#FF1493"]
data["readmitted"].value_counts().plot.pie(autopct="%.1f%%", colors=colors)
plt.title("Readmission repartitions")
plt.show()

The bar plot shows the distribution of readmission statuses in the dataset. The readmission status is categorized into three groups:

NO: The patient was not readmitted.

more than 30: The patient was readmitted after more than 30 days.

less than 30: The patient was readmitted within 30 days.

In [55]:
diabetes_data=data
sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot for 'age'
sns.histplot(diabetes_data['age'], kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age')

# Plot for 'time_in_hospital'
sns.histplot(diabetes_data['time_in_hospital'], kde=True, ax=axes[1])
axes[1].set_title('Distribution of Time in Hospital')

# Plot for 'num_medications'
sns.histplot(diabetes_data['num_medications'], kde=True, ax=axes[2])
axes[2].set_title('Distribution of Number of Medications')

plt.tight_layout()
plt.show()
In [114]:
corr_matrix = data_encoded.corr()

plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, cmap='coolwarm')

plt.title('Heatmap of Encoded Data Correlations')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()
In [56]:
NumMedicationsAverage = data_cleaned.groupby("race")["num_medications"].mean()
raceGender = data_cleaned.groupby("race")["race"].count()
TimeInHospital = data_cleaned.groupby("race")["time_in_hospital"].mean()
NumMedicationsAverage = pd.DataFrame({
    "Race" : list(NumMedicationsAverage.index),
    "NumMedications" : list(NumMedicationsAverage),
    "raceGender" : list(raceGender),
    "TimeInHospital" : list(TimeInHospital)})

fig = px.scatter(NumMedicationsAverage, x='NumMedications', y='TimeInHospital', color='raceGender', size='raceGender',
                 height=500, text='Race', log_x=True, log_y=True, 
                 color_discrete_sequence=px.colors.qualitative.Vivid,
                 template='plotly_dark')
fig.update_traces(marker_size=30, marker_line_width=1, opacity=0.7)
fig.update_layout(plot_bgcolor = 'rgb(50,50,50)',
                  paper_bgcolor = 'rgb(50,50,50)',
                  font=dict(family="Arial, sans-serif",
                            size=15,
                            color="#FFFFFF"))
fig.show()
In [57]:
fig = px.scatter(data, x="time_in_hospital", y="age", size="time_in_hospital",
                 color="age", hover_name="time_in_hospital", size_max=30,
                 title='Gender Distribution by time_in_hospital',
                 color_discrete_map={'Male': 'blue', 'Female': 'pink'},
                 template='plotly_dark', width=500, height= 400)

fig.update_layout(plot_bgcolor = 'rgb(50,50,50)',
                  paper_bgcolor = 'rgb(50,50,50)',
                  font=dict(family="Arial, sans-serif",
                            size=15,
                            color="#FFFFFF"))

fig.show()
In [58]:
race_top = data['race'].value_counts()[:]

race_top = race_top.sort_values(ascending=False)

print("The top races in the dataset are:\n")
print("{:2s} {:20s} {:s}".format("#", "Race", "Count"))
for i, ranking in enumerate(race_top.index):
    print("{:2d} {:20s} {:d}".format(i+1, ranking, int(race_top[ranking])))

rcParams['figure.facecolor'] = '#343131'
plt.style.use('dark_background')

#creating a 3D bar plot
fig1 = plt.figure(figsize=(15,15))
ax = plt.axes(projection='3d')
ax.bar3d(range(len(race_top)), range(len(race_top)), np.zeros(len(race_top)), 1, 1, race_top.values, shade=True, color=sns.color_palette("RdBu", n_colors=len(race_top.values)))
ax.set_xticks(range(len(race_top)))
ax.set_xticklabels(race_top.index)
ax.set_title("Top Races in the Dataset")
ax.set_xlabel("Race")
ax.set_ylabel("Ranking")
ax.set_zlabel("Count")
plt.show()
The top races in the dataset are:

#  Race                 Count
 1 Caucasian            33164
 2 AfricanAmerican      8575
 3 Hispanic             860
 4 Other                720
 5 Asian                307
In [59]:
readmitted_data = data[data['isReadmitted'] == True]

pivot_table = pd.pivot_table(data, values='isReadmitted', index=['age'], columns=['race'], aggfunc='count', fill_value=0)
In [60]:
import plotly.express as px

def plot_treemap(data):
    if 'time_in_hospital' in data.columns and 'race' in data.columns:
        data_for_treemap = data.groupby(["time_in_hospital", "race"]).size().reset_index(name='counts')

        fig = px.treemap(data_for_treemap, path=["time_in_hospital", "race"], values='counts',
                         title="Number of times spent by each race", 
                         color_discrete_sequence=px.colors.qualitative.Dark2)
        fig.data[0].textinfo = 'label+text'
        fig.show()
    else:
        print("Required columns are not in the DataFrame")

plot_treemap(data)


4. Studying Machine Learning models


Thus this far, we have fathomed what machine learning models we could apply to our dataset: RandomForest, DecisionTree, Linear Regression and DecisionTree classifiers. We will try to predict our values and evaluate our models. Then, we will analyze our results.

a. Evaluating Models: Linear Regression, RandomForest and Decision Tree Classifiers

Classifiers Models

  • Here are the reasons on why we should use these algorithms for our classification problem.
  • Linear Regression

    • Simplicity and Interpretability: It models the relationship between a dependent variable and one or more independent variables.
    • Baseline Model: Can be used as a starting point for regression problems to establish a baseline performance.

    Random Forest

    • Non-Linearity: Random Forest can handle non-linear data effectively. It combines multiple decision trees to improve the model's accuracy and prevent overfitting.
    • Feature Importance: Provides insights into which features are most important in making predictions.
    • Versatility: Works well for both classification and regression tasks and can handle large datasets with higher dimensionality.

    Decision Tree

    • Easy to Interpret: Decision Trees can be visualized easily, making them highly interpretable.
    • Non-Parametric: Does not require much data preprocessing, such as normalization or scaling. They can handle both numerical and categorical data.
    In [95]:
    def evaluate_model(X, y, classifier='RF', random_state=42):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    
        if classifier == 'RF':
            model = RandomForestClassifier(n_estimators=100, random_state=random_state)
        elif classifier == 'LR':
            model = LogisticRegression(random_state=random_state)
        elif classifier == 'DT':
            model = DecisionTreeClassifier(random_state=random_state)
        else:
            raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression, or 'DT' for DecisionTree.")
            
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', model)
        ])
    
        pipeline.fit(X_train, y_train)
    
    
        y_pred = pipeline.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        
        print(f"Model: {classifier}")
        print(f"Accuracy: {accuracy}")
        print(f"Classification Report:\n{report}")
        
        #Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=y_test.unique(), yticklabels=y_test.unique())
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.title(f'Confusion Matrix for {classifier}')
        plt.show()
        
        return y_pred, accuracy, report 
    
    In [96]:
    X = dataForClassification.drop('readmitted', axis=1)
    X = pd.get_dummies(X)
    y = data_encoded['readmitted']
    
    In [97]:
    RF_y_pred, RF_accuracy, RF_report = evaluate_model(X, y, classifier='RF')
    
    Model: RF
    Accuracy: 0.6226757369614513
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.65      0.82      0.73      6742
               1       0.52      0.31      0.39      4283
    
        accuracy                           0.62     11025
       macro avg       0.59      0.57      0.56     11025
    weighted avg       0.60      0.62      0.60     11025
    
    
    In [98]:
    LR_y_pred, LR_accuracy, LR_report = evaluate_model(X, y, classifier='LR')
    
    Model: LR
    Accuracy: 0.6346485260770975
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.64      0.91      0.75      6742
               1       0.59      0.20      0.30      4283
    
        accuracy                           0.63     11025
       macro avg       0.61      0.56      0.53     11025
    weighted avg       0.62      0.63      0.58     11025
    
    
    In [99]:
    DT_y_pred, DT_accuracy, DT_report = evaluate_model(X, y, classifier='DT')
    
    Model: DT
    Accuracy: 0.5438548752834467
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.63      0.61      0.62      6742
               1       0.42      0.44      0.43      4283
    
        accuracy                           0.54     11025
       macro avg       0.52      0.52      0.52     11025
    weighted avg       0.55      0.54      0.55     11025
    
    
    In [100]:
    accuracies = {'LR': LR_accuracy, 'RF': RF_accuracy, 'DT': DT_accuracy}
    
    plt.bar(accuracies.keys(), accuracies.values())
    plt.xlabel('Classifier')
    plt.ylabel('Accuracy')
    plt.title('Comparison of Classifier Accuracies')
    plt.show()
    

    Observations :

    Our first results show us promising results in bout Linear Regression and Random Forest Classifiers, but Decision Tree seems to lack a few parameters to be completely operational.

    b. Feature Importances and Coefficients

    In [96]:
    def evaluate_model(X, y, feature_names, classifier='RF', random_state=42):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    
        if classifier == 'RF':
            model = RandomForestClassifier(n_estimators=100, random_state=random_state)
        elif classifier == 'LR':
            model = LogisticRegression(random_state=random_state)
        elif classifier == 'DT':
            model = DecisionTreeClassifier(random_state=random_state)
        else:
            raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression, or 'DT' for DecisionTree.")
            
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', model)
        ])
    
        pipeline.fit(X_train, y_train)
    
        if classifier in ['RF', 'DT']:
            feature_importances = pipeline.named_steps['classifier'].feature_importances_
    
            #Combines feature names and their importances, then sort them
            feature_importance_pairs = sorted(zip(feature_names, feature_importances), key=lambda x: x[1], reverse=True)
            print(f"Model: {classifier}")
            print("Feature Importances (Descending):")
            for name, importance in feature_importance_pairs:
                print(f"{name}: {importance}")
    
        if classifier == 'LR':
            #Retrieves the model's coefficients
            coefficients = pipeline.named_steps['classifier'].coef_[0]
    
            #Combine feature names and their coefficients, then sort them
            feature_coefficient_pairs = sorted(zip(feature_names, coefficients), key=lambda x: x[1], reverse=True)
    
            print("Feature Coefficients (Descending):")
            for name, coef in feature_coefficient_pairs:
                print(f"{name}: {coef}")
    
    In [66]:
    evaluate_model(X, y, X.columns, classifier='RF')
    
    Model: RF
    Feature Importances (Descending):
    health: 0.13944856242343415
    num_lab_procedures: 0.12228688866277172
    num_medications: 0.10581542615079932
    time_in_hospital: 0.07607638756092942
    age: 0.06476983200638976
    num_procedures: 0.050125380473973837
    number_diagnoses: 0.048963308300453194
    number_inpatient: 0.01900303140089058
    diag_3_Other: 0.0175999340275441
    number_outpatient: 0.017497994117030475
    diag_3_Circulatory: 0.016812376810658645
    gender_Female: 0.015891160969829156
    gender_Male: 0.015887524156369803
    diag_1_Circulatory: 0.015615727735958186
    diag_1_Other: 0.014790355277338693
    discharge_disposition_id_Referral: 0.013476770403995534
    diag_1_Respiratory: 0.012344556864803631
    discharge_disposition_id_Other: 0.012017967646439011
    number_emergency: 0.011275603160365102
    diag_3_Diabetes: 0.010881445191778277
    metformin_No: 0.010609021098365142
    metformin_Steady: 0.010418074603075502
    diag_1_Digestive: 0.01039304990052914
    change_No: 0.01023476906144564
    change_Ch: 0.010176811371080795
    diag_3_Respiratory: 0.009683419826684869
    glipizide_No: 0.009610437409541383
    admission_source_id_Emergency: 0.009467386314602906
    glipizide_Steady: 0.009075620946668002
    diag_3_Genitourinary: 0.008733477646221935
    admission_source_id_Referral: 0.008732420472440994
    diag_1_Genitourinary: 0.007982419535780922
    diag_1_Injury: 0.007741673444785552
    diag_3_Digestive: 0.006875896594587546
    admission_type_id_Emergency: 0.006184451544552506
    admission_type_id_Elective: 0.0061195951930300705
    diag_1_Musculoskeletal: 0.0061130051725721835
    admission_source_id_Other: 0.005654659723370894
    diag_1_Neoplasms: 0.005591065020083724
    diabetesMed_No: 0.005311876083012215
    discharge_disposition_id_18: 0.00527382110797224
    diabetesMed_Yes: 0.004746549282060857
    diag_3_Injury: 0.004214963502264784
    diag_3_Musculoskeletal: 0.004036219778009884
    diag_3_Neoplasms: 0.003992869458450163
    metformin_Up: 0.0022489258331206546
    glipizide_Up: 0.00174328322228958
    metformin_Down: 0.0014938681583061594
    glipizide_Down: 0.0014361661574670032
    discharge_disposition_id_Emergency: 0.001361500412408972
    diag_1_Pregnancy: 0.001269779080681933
    discharge_disposition_id_23: 0.0010246763217487066
    diag_3_Pregnancy: 0.0006030024021905921
    diag_1_Diabetes: 0.0005388805403514824
    discharge_disposition_id_28: 0.000454583668745858
    discharge_disposition_id_24: 0.00014173728398512054
    admission_type_id_Trauma Center: 4.806354748208742e-05
    admission_type_id_New Born: 4.2629402584261597e-05
    discharge_disposition_id_12: 2.046089439525105e-05
    discharge_disposition_id_19: 9.664268961823985e-06
    discharge_disposition_id_27: 8.99137233821077e-06
    
    In [67]:
    evaluate_model(X, y, X.columns, classifier='LR')
    
    Feature Coefficients (Descending):
    number_inpatient: 0.21483572528857936
    number_emergency: 0.11624915456463253
    health: 0.10624136369348006
    number_diagnoses: 0.09995854373120733
    number_outpatient: 0.07623847524146755
    time_in_hospital: 0.0756953119070123
    age: 0.0650001337356093
    diag_1_Circulatory: 0.05569944443215995
    diabetesMed_Yes: 0.05475255536523958
    admission_source_id_Emergency: 0.04014562089172916
    discharge_disposition_id_28: 0.03618420653962188
    admission_source_id_Referral: 0.03313911504147939
    glipizide_Down: 0.026109078514487485
    admission_type_id_Emergency: 0.02554799568465501
    metformin_No: 0.0248120518516215
    discharge_disposition_id_Referral: 0.023224081558972456
    diag_3_Circulatory: 0.019823182469224394
    gender_Female: 0.014651993539839554
    diag_1_Other: 0.01260389392390893
    glipizide_Steady: 0.010794480359940064
    change_Ch: 0.010693403115696795
    diag_3_Digestive: 0.00971100295859231
    diag_3_Other: 0.0065007211765714
    diag_3_Genitourinary: 0.0064842793936444355
    diag_1_Respiratory: 0.005047609642537294
    diag_3_Respiratory: 0.0037210779176871534
    discharge_disposition_id_12: 0.0031687868648649347
    admission_type_id_New Born: 0.002033035728410787
    diag_3_Pregnancy: 0.0017442212561520644
    discharge_disposition_id_Emergency: 0.001161316349126094
    metformin_Down: -0.0005273443441462288
    diag_3_Musculoskeletal: -0.0034489881597681263
    diag_1_Musculoskeletal: -0.005818550921573946
    diag_1_Digestive: -0.00861318581470544
    glipizide_Up: -0.008983378472124583
    change_No: -0.010693403115693592
    discharge_disposition_id_Other: -0.011470646342594669
    glipizide_No: -0.013665325424637952
    discharge_disposition_id_18: -0.013810725093261322
    gender_Male: -0.014651993539840112
    diag_1_Injury: -0.017785114944291802
    metformin_Up: -0.019213894245900864
    admission_type_id_Elective: -0.019477319079029262
    metformin_Steady: -0.02030208673628146
    diag_1_Genitourinary: -0.021486380330947925
    diag_3_Neoplasms: -0.02213709118333147
    diag_3_Injury: -0.024651415694954746
    diag_3_Diabetes: -0.028719926932409393
    discharge_disposition_id_24: -0.02874360803376186
    diag_1_Diabetes: -0.030959807752258678
    num_medications: -0.032462888224795206
    discharge_disposition_id_23: -0.03814133286328189
    diabetesMed_No: -0.05475255536523958
    discharge_disposition_id_27: -0.05682576758164103
    diag_1_Neoplasms: -0.06396493176913116
    discharge_disposition_id_19: -0.06643407995714638
    diag_1_Pregnancy: -0.06652209842799275
    num_lab_procedures: -0.07367467980460696
    num_procedures: -0.07388248029631934
    admission_source_id_Other: -0.13792650889343377
    admission_type_id_Trauma Center: -0.141294739279956
    
    In [97]:
    evaluate_model(X, y, X.columns, classifier='DT')
    
    Model: DT
    Feature Importances (Descending):
    health: 0.19467809029367722
    num_lab_procedures: 0.14356510764026692
    num_medications: 0.10757291867109377
    time_in_hospital: 0.07109384352313515
    age: 0.051177185659353405
    num_procedures: 0.03711096598595999
    number_diagnoses: 0.0361110500390211
    diag_3_Circulatory: 0.019394766284820016
    number_inpatient: 0.018940487656694815
    diag_3_Other: 0.01594629938262362
    diag_1_Circulatory: 0.015016276662505745
    gender_Female: 0.01410552565672047
    diag_1_Other: 0.013513260418592968
    number_outpatient: 0.012895451789864495
    gender_Male: 0.011282824196652353
    diag_3_Genitourinary: 0.010712851402882029
    diag_3_Respiratory: 0.010677553312502667
    discharge_disposition_id_Referral: 0.010601807167048175
    diag_1_Digestive: 0.010392105961269457
    discharge_disposition_id_Other: 0.010124578247870207
    diag_3_Diabetes: 0.009716379959136805
    change_No: 0.009533271775835411
    diag_1_Respiratory: 0.009522835380924305
    glipizide_Steady: 0.009319130748747248
    metformin_No: 0.008588199335121413
    change_Ch: 0.008467884488902804
    diag_1_Genitourinary: 0.008317829155426674
    metformin_Steady: 0.008252965082819506
    diag_1_Injury: 0.008215022420421984
    admission_source_id_Emergency: 0.008192014668104063
    admission_source_id_Referral: 0.008125429016038393
    diag_3_Digestive: 0.007813974960041029
    admission_type_id_Emergency: 0.007570498039228126
    number_emergency: 0.007459523154536431
    glipizide_No: 0.007067071643168371
    diag_1_Neoplasms: 0.006146822925012325
    diabetesMed_No: 0.0059760477097499706
    discharge_disposition_id_18: 0.005383320503903086
    admission_source_id_Other: 0.005333626248607413
    diag_1_Musculoskeletal: 0.005194542982799461
    admission_type_id_Elective: 0.005005373834801787
    diag_3_Musculoskeletal: 0.004432912609087211
    diag_3_Neoplasms: 0.003392786963379391
    diag_3_Injury: 0.0031099358183306054
    diabetesMed_Yes: 0.0029298760546444436
    metformin_Down: 0.002443328880450783
    metformin_Up: 0.001989402068666195
    diag_1_Pregnancy: 0.0014134405889158308
    glipizide_Down: 0.0013646007605203534
    discharge_disposition_id_Emergency: 0.0012711365707662628
    glipizide_Up: 0.0012430302760499156
    discharge_disposition_id_23: 0.0007838759410484252
    diag_3_Pregnancy: 0.0006233242716186621
    diag_1_Diabetes: 0.00046025946997530864
    discharge_disposition_id_28: 0.00025393882123982055
    discharge_disposition_id_24: 0.00010771098715241646
    admission_type_id_Trauma Center: 6.57259322732852e-05
    admission_type_id_New Born: 0.0
    discharge_disposition_id_12: 0.0
    discharge_disposition_id_19: 0.0
    discharge_disposition_id_27: 0.0
    

    Observations :

    We can now get a better grasp of the core components of our analysises, we realise our created variable 'health' has one of the highest scores.

    c. Grid Search: Finding the best hyperparameters

    In [68]:
    def evaluate_model_with_gridsearch(X, y, classifier='RF', random_state=42):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    
        if classifier == 'RF':
            model = RandomForestClassifier(random_state=random_state)
            param_grid = {
                'classifier__n_estimators': [700], #Values tested: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000
                'classifier__max_depth': [10,20] #Values tested: 0.01, 0.1, 1, 5, 10, 15, 20, 40, 50, 100
            }
        elif classifier == 'LR':
            model = LogisticRegression(random_state=random_state)
            param_grid = {
                'classifier__C': [100, 0.1, 5, 10],
                'classifier__penalty': ['l2', 'none'],
                'classifier__solver': ['lbfgs', 'liblinear']
            }
        elif classifier == 'DT':
            model = DecisionTreeClassifier(random_state=random_state)
            param_grid = {
            'classifier__max_depth': [10, 20, 30, None],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4],
            'classifier__criterion': ['gini', 'entropy']
            }
        else:
            raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression or 'DT' for DecisionTree Classifier.")
    
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', model)
        ])
    
        grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
        grid_search.fit(X_train, y_train)
    
        print(f"Best Parameters: {grid_search.best_params_}")
        print(f"Best CV Score: {grid_search.best_score_}")
    
        y_pred = grid_search.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        
        print(f"Model: {classifier}")
        print(f"Accuracy on Test Set: {accuracy}")
        print(f"Classification Report on Test Set:\n{report}")
    
    In [69]:
    evaluate_model_with_gridsearch(X, y, classifier='RF') #20 and 700
    
    Best Parameters: {'classifier__max_depth': 20, 'classifier__n_estimators': 700}
    Best CV Score: 0.6270748299319727
    Model: RF
    Accuracy on Test Set: 0.6351927437641723
    Classification Report on Test Set:
                  precision    recall  f1-score   support
    
               0       0.65      0.88      0.75      6742
               1       0.57      0.25      0.35      4283
    
        accuracy                           0.64     11025
       macro avg       0.61      0.56      0.55     11025
    weighted avg       0.62      0.64      0.59     11025
    
    
    In [70]:
    evaluate_model_with_gridsearch(X, y, classifier='LR')
    
    Best Parameters: {'classifier__C': 100, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
    Best CV Score: 0.6244671201814059
    Model: LR
    Accuracy on Test Set: 0.6346485260770975
    Classification Report on Test Set:
                  precision    recall  f1-score   support
    
               0       0.64      0.91      0.75      6742
               1       0.59      0.20      0.30      4283
    
        accuracy                           0.63     11025
       macro avg       0.61      0.56      0.53     11025
    weighted avg       0.62      0.63      0.58     11025
    
    
    In [71]:
    evaluate_model_with_gridsearch(X, y, classifier='DT')
    
    Best Parameters: {'classifier__criterion': 'entropy', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5}
    Best CV Score: 0.617891156462585
    Model: DT
    Accuracy on Test Set: 0.6234920634920635
    Classification Report on Test Set:
                  precision    recall  f1-score   support
    
               0       0.65      0.85      0.73      6742
               1       0.53      0.26      0.35      4283
    
        accuracy                           0.62     11025
       macro avg       0.59      0.56      0.54     11025
    weighted avg       0.60      0.62      0.59     11025
    
    

    Observations :

    We can now set our best parameters for our models, so that now we can re-evaluate them.

    d. Re-evaluating the model with the best parameters

    In [103]:
    def evaluate_model(X, y, feature_names, classifier='RF', random_state=42):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    
        if classifier == 'RF':
            model = RandomForestClassifier(max_depth=20, n_estimators=700, random_state=random_state)
        elif classifier == 'LR':
            model = LogisticRegression(penalty = 'l2',solver = 'lbfgs', random_state=random_state)
        elif classifier == 'DT':
            model = DecisionTreeClassifier(criterion='entropy',max_depth=10,min_samples_leaf=2,random_state=random_state)
            #criterion': 'entropy', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 
            #'classifier__min_samples_split': 5}
        else:
            raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest or 'LR' for LogisticRegression.")
    
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', model)
        ])
    
        pipeline.fit(X_train, y_train)
    
        y_pred = pipeline.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        report = classification_report(y_test, y_pred)
        
        print(f"Model: {classifier}")
        print(f"Accuracy: {accuracy}")
        print(f"Classification Report:\n{report}")
        
        # Plotting Confusion Matrix
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=y_test.unique(), yticklabels=y_test.unique())
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.title(f'Confusion Matrix for {classifier}')
        plt.show()
        
        return y_pred, accuracy, report 
    
    In [104]:
    X = dataForClassification.drop('readmitted', axis=1)
    X = pd.get_dummies(X)
    y = data_encoded['readmitted']
    
    In [106]:
    RF_y_pred, RF_accuracy, RF_report = evaluate_model(X, y, X.columns, classifier='RF')
    
    Model: RF
    Accuracy: 0.6351927437641723
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.65      0.88      0.75      6742
               1       0.57      0.25      0.35      4283
    
        accuracy                           0.64     11025
       macro avg       0.61      0.56      0.55     11025
    weighted avg       0.62      0.64      0.59     11025
    
    
    In [107]:
    LR_y_pred, LR_accuracy, LR_report = evaluate_model(X, y, X.columns, classifier='LR')
    
    Model: LR
    Accuracy: 0.6346485260770975
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.64      0.91      0.75      6742
               1       0.59      0.20      0.30      4283
    
        accuracy                           0.63     11025
       macro avg       0.61      0.56      0.53     11025
    weighted avg       0.62      0.63      0.58     11025
    
    
    In [78]:
    from sklearn.tree import DecisionTreeClassifier
    
    In [108]:
    DT_y_pred, DT_accuracy, DT_report = evaluate_model(X, y, X.columns, classifier='DT')
    
    Model: DT
    Accuracy: 0.6236734693877551
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.65      0.85      0.73      6742
               1       0.53      0.26      0.35      4283
    
        accuracy                           0.62     11025
       macro avg       0.59      0.56      0.54     11025
    weighted avg       0.60      0.62      0.59     11025
    
    
    In [109]:
    accuracies = {'LR': LR_accuracy, 'RF': RF_accuracy, 'DT': DT_accuracy}
    
    plt.bar(accuracies.keys(), accuracies.values())
    plt.xlabel('Classifier')
    plt.ylabel('Accuracy')
    plt.title('Comparison of Classifier Accuracies')
    plt.show()
    

    Observations :

    Because of Grid Search, We enhanced our performances on our three models. We can see that the three values converge around 0,64. We can then come close to the conclusion that our dataset can only allow us so much.

    e. With a smaller dataset?

    In [67]:
    dataForClassification.columns
    
    Out[67]:
    Index(['gender', 'age', 'admission_type_id', 'discharge_disposition_id',
           'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
           'num_procedures', 'num_medications', 'number_outpatient',
           'number_emergency', 'number_inpatient', 'diag_1', 'diag_3',
           'number_diagnoses', 'metformin', 'glipizide', 'change', 'diabetesMed',
           'readmitted', 'health'],
          dtype='object')
    In [68]:
    columns_to_drop = [
        'admission_source_id',
        'diag_3', 'diag_1', 'diabetesMed',
        'metformin', 'glipizide', 'metformin', 'glipizide',
        'discharge_disposition_id', 'admission_type_id'
    ]
    
    X = dataForClassification.drop(['readmitted'] + columns_to_drop, axis=1)
    X = pd.get_dummies(X)
    y = data_encoded['readmitted']
    
    In [69]:
    evaluate_model(X, y, X.columns, classifier='RF')
    
    Model: RF
    Accuracy: 0.617233560090703
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.65      0.83      0.73      6742
               1       0.51      0.28      0.37      4283
    
        accuracy                           0.62     11025
       macro avg       0.58      0.56      0.55     11025
    weighted avg       0.59      0.62      0.59     11025
    
    
    In [71]:
    evaluate_model(X, y, X.columns, classifier='DT')
    
    Model: DT
    Accuracy: 0.6222222222222222
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.64      0.86      0.74      6742
               1       0.53      0.25      0.34      4283
    
        accuracy                           0.62     11025
       macro avg       0.59      0.55      0.54     11025
    weighted avg       0.60      0.62      0.58     11025
    
    
    In [70]:
    evaluate_model(X, y, X.columns, classifier='LR')
    
    Model: LR
    Accuracy: 0.6296598639455783
    Classification Report:
                  precision    recall  f1-score   support
    
               0       0.63      0.93      0.75      6742
               1       0.59      0.16      0.25      4283
    
        accuracy                           0.63     11025
       macro avg       0.61      0.54      0.50     11025
    weighted avg       0.62      0.63      0.56     11025
    
    

    Observations :

    Removing columns with few influence did impact the model on a small scale, that makes us wonder if it were needed to scale up, to gain HC power we could simply remove them. On a smaller scale like here, we will keep them so that we can have the best prediction.


    5.Regression Study


    a. Evaluating Models for regression

    In [80]:
    dataForReg.head()
    
    Out[80]:
    gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses insulin change diabetesMed health
    1 Female 15 Emergency Referral Emergency 3 59 0 18 0 0 0 Other Other Other 9 Up Ch Yes -0.148877
    3 Male 35 Emergency Referral Emergency 2 44 1 16 0 0 0 Other Other Circulatory 7 Up Ch Yes -1.994875
    4 Male 45 Emergency Referral Emergency 1 51 0 8 0 0 0 Neoplasms Neoplasms Diabetes 5 Steady Ch Yes -4.544908
    5 Male 55 Emergency Referral Referral 3 31 6 16 0 0 0 Circulatory Circulatory Diabetes 9 Steady No Yes 1.529434
    7 Male 75 Emergency Referral Emergency 5 73 0 12 0 0 0 Circulatory Respiratory Diabetes 8 No No Yes -0.001356
    In [81]:
    dataForReg.shape
    
    Out[81]:
    (55125, 20)
    In [82]:
    X = dataForReg.drop("time_in_hospital", axis=1)
    X = X.drop("health",axis = 1)
    X = pd.get_dummies(X)
    y = dataForReg["time_in_hospital"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Standardize the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    

    Data explained

    Our approach involved a straightforward process: excluding health parameters. Health, in this context, is an aggregate of components considering the number of days spent in the hospital. The solution to our inquiry was inherent in the question itself. Additionally, we had to convert discrete variables into a binary format using dummy variables, ensuring compatibility with the sklearn library.

    In [83]:
    models = [
        ("Linear Regression", LinearRegression()),
        ("Ridge Regression", Ridge()),
        ("Lasso Regression", Lasso()),
        ("Decision Tree", DecisionTreeRegressor(random_state=42)),
        ("Random Forest", RandomForestRegressor(random_state=42)),
        ("Gradient Boosting", GradientBoostingRegressor(random_state=42)),
    ]
    
    In [84]:
    model_performance = {}
    feature_importance_dict = {}
    
    for name, model in models:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        # calcul de RMSE
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        
        model_performance[name] = rmse
        
        # to get the most importants variables it depends on the model
        if name in ["Linear Regression", "Ridge Regression", "Lasso Regression"]:
            importance = model.coef_
        else:
            importance = model.feature_importances_
        feature_importance = pd.DataFrame(importance, index=X_train.columns, columns=["Importance"])
        feature_importance = feature_importance.sort_values("Importance", ascending=False)
        feature_importance_dict[name] = feature_importance
    
    # printing models performances
    model_performance_df = pd.DataFrame(list(model_performance.items()), columns=['Model', 'RMSE'])
    print(model_performance_df)
    for name, importance in feature_importance_dict.items():
        print(f"Importance des features pour {name}:")
        print(importance.head(10))
        print("\n")
    
                   Model      RMSE
    0  Linear Regression  2.425287
    1   Ridge Regression  2.424914
    2   Lasso Regression  2.528947
    3      Decision Tree  3.326711
    4      Random Forest  2.359572
    5  Gradient Boosting  2.339490
    Importance des features pour Linear Regression:
                                     Importance
    discharge_disposition_id_27        3.544452
    discharge_disposition_id_23        1.884468
    discharge_disposition_id_18        0.788320
    diag_1_Neoplasms                   0.754714
    admission_type_id_Trauma Center    0.665000
    discharge_disposition_id_Other     0.620263
    diag_1_Pregnancy                   0.491586
    diag_1_Other                       0.370570
    admission_source_id_Other          0.336717
    admission_type_id_Emergency        0.275796
    
    
    Importance des features pour Ridge Regression:
                                     Importance
    discharge_disposition_id_27        2.344777
    discharge_disposition_id_23        1.848584
    discharge_disposition_id_18        0.762514
    diag_1_Neoplasms                   0.754616
    admission_type_id_Trauma Center    0.611150
    discharge_disposition_id_Other     0.594723
    diag_1_Pregnancy                   0.485953
    diag_1_Other                       0.370963
    admission_source_id_Other          0.336610
    admission_type_id_Emergency        0.265064
    
    
    Importance des features pour Lasso Regression:
                        Importance
    num_medications       0.129605
    num_lab_procedures    0.034375
    age                   0.014310
    diabetesMed_No        0.000000
    change_No            -0.000000
    diag_1_Pregnancy      0.000000
    diag_1_Respiratory   -0.000000
    diag_2_Circulatory   -0.000000
    diag_2_Diabetes      -0.000000
    diag_2_Digestive      0.000000
    
    
    Importance des features pour Decision Tree:
                                       Importance
    num_medications                      0.265303
    num_lab_procedures                   0.176683
    age                                  0.052088
    num_procedures                       0.051753
    number_diagnoses                     0.040806
    number_inpatient                     0.016910
    discharge_disposition_id_Referral    0.015537
    number_outpatient                    0.015141
    diag_3_Other                         0.014773
    diag_2_Other                         0.013906
    
    
    Importance des features pour Random Forest:
                                       Importance
    num_medications                      0.271208
    num_lab_procedures                   0.174330
    age                                  0.052580
    num_procedures                       0.050526
    number_diagnoses                     0.040372
    number_inpatient                     0.017384
    discharge_disposition_id_Referral    0.016276
    diag_1_Other                         0.014671
    number_outpatient                    0.013804
    diag_3_Other                         0.013477
    
    
    Importance des features pour Gradient Boosting:
                                       Importance
    num_medications                      0.552992
    num_lab_procedures                   0.206317
    num_procedures                       0.040134
    discharge_disposition_id_Referral    0.033622
    age                                  0.025623
    number_diagnoses                     0.019056
    admission_type_id_Elective           0.018685
    diag_1_Other                         0.013985
    diag_1_Neoplasms                     0.010191
    discharge_disposition_id_Other       0.008476
    
    
    

    a1. Little verification for the most important feature

    In [85]:
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    
    sampled_data = dataForReg.sample(500, random_state=42)
    
    fig = plt.figure(figsize=(15, 10))
    
    ax1 = fig.add_subplot(221, projection='3d')
    ax1.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
    ax1.set_title('Perspective 1')
    ax1.set_xlabel('num_medications')
    ax1.set_ylabel('num_lab_procedures')
    ax1.set_zlabel('time_in_hospital')
    ax1.view_init(elev=20, azim=30)
    
    ax2 = fig.add_subplot(222, projection='3d')
    ax2.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
    ax2.set_title('Perspective 2')
    ax2.set_xlabel('num_medications')
    ax2.set_ylabel('num_lab_procedures')
    ax2.set_zlabel('time_in_hospital')
    ax2.view_init(elev=20, azim=-60)
    
    ax3 = fig.add_subplot(223, projection='3d')
    ax3.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
    ax3.set_title('Perspective 3')
    ax3.set_xlabel('num_medications')
    ax3.set_ylabel('num_lab_procedures')
    ax3.set_zlabel('time_in_hospital')
    ax3.view_init(elev=5, azim=90)
    
    ax4 = fig.add_subplot(224, projection='3d')
    ax4.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
    ax4.set_title('Perspective 4')
    ax4.set_xlabel('num_medications')
    ax4.set_ylabel('num_lab_procedures')
    ax4.set_zlabel('time_in_hospital')
    ax4.view_init(elev=30, azim=150)
    
    plt.show()
    
    In [86]:
    model_performance_df = pd.DataFrame(list(model_performance.items()), columns=['Model', 'RMSE'])
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Model', y='RMSE', data=model_performance_df, palette='viridis')
    plt.title('Performance des modèles de régression')
    plt.ylabel('RMSE')
    plt.xlabel('Modèle')
    plt.xticks(rotation=45)
    plt.show()
    

    Results :

    Above, we observe that the Gradient Boosting approach appears to be the most effective, although it is closely trailed by the other methods, with the exception of the decision tree, which performs notably worse. Regarding the most influential parameters, we can confidently assert that they are 'num_medications' and 'num_lab_procedures.' This correlation makes perfect sense, as a higher count of lab procedures is associated with a prolonged hospital stay. However, it comes as a surprise to us that the number of medications is not ranked higher; this finding is unexpected.

    b. Improve the data with PCA

    In [87]:
    from sklearn.decomposition import PCA
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    
    X = dataForReg.drop("time_in_hospital", axis=1)
    X = X.drop("health", axis=1)
    X = pd.get_dummies(X)
    y = dataForReg["time_in_hospital"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('pca', PCA(n_components=0.95)),  
        ('regressor', LinearRegression())
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"RMSE with PCA: {rmse}")
    
    RMSE with PCA: 2.46990744845903
    

    Observations :

    Here we can observe that the PCA got a bad score (2.45) compared to without (2.40). This mean that the PCA do not represent well the data.

    c. Find hyperparameters for the best model (Gradient boosting)

    In [88]:
    param_grid = {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 4, 5],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    param_grid = {
        'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 4],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    }
    
    #Gradient Boosting Regressor
    gb_regressor = GradientBoostingRegressor(random_state=42)
    
    # Create the GridSearchCV object
    grid_search = GridSearchCV(
        estimator=gb_regressor,
        param_grid=param_grid,
        scoring='neg_mean_squared_error',  #appropriate scoring metric
        cv=5,  
        n_jobs=-1  
    )
    
    grid_search.fit(X_train_scaled, y_train)
    print("Best Parameters: ", grid_search.best_params_)
    
    best_gb_model = grid_search.best_estimator_
    
    y_pred_grid_search = best_gb_model.predict(X_test_scaled)
    
    # Calculate RMSE
    rmse_grid_search = np.sqrt(mean_squared_error(y_test, y_pred_grid_search))
    print("RMSE with Grid Search: ", rmse_grid_search)
    
    testet
    Best Parameters:  {'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
    RMSE with Grid Search:  2.321570509543093
    

    Observations :

    Here we observe that with hyperparametrisation we got a good improvement : from 2.469 to 2.3216.

    6. Conclusion

    We have now studied this dataset from the Machine Learning Models point of view, and can conclude to good results with the information we have extracted.

    Reflexions

    With the obtained results, we can already make some previsions that lean more towards preventive measures than pinpoint predictions. This shift from a purely reactive to a more preventive stance in healthcare could significantly change how we approach chronic diseases like diabetes. It emphasizes the importance of early intervention and continuous, personalized care to manage the disease more effectively.

    In conclusion, while our study has made considerable progress in predicting patient readmission, it also highlights the need for further research. This research should aim to uncover the less visible variables affecting diabetes, improve measurement methods, and ultimately, guide us towards more effective and personalized healthcare solutions.


    Thank you for reading

    Oscar Pastural - Capucine Boudin - Richard Goudelin